Solution: Strictly Follow Referencing Rules
Let’s find out the ways to clear out the Pseudo key Neat-Freak antipattern.
We'll cover the following
The values in any primary key must be unique and non-null so we can use them to reference individual rows, but that’s the only rule — they don’t have to be consecutive numbers to identify rows.
Numbering rows#
Most pseudo key generators return numbers that almost look like row numbers because they’re monotonically increasing (that is, each successive value is one greater than the preceding value), but this is only a coincidence of their implementation. Generating values in this way is a convenient way to ensure uniqueness.
We mustn’t confuse row numbers with primary keys. A primary key identifies one row in one table, whereas row numbers identify rows in a result set. Row numbers in a query result set don’t correspond to primary key values in the table, especially when we use query operations like JOIN
, GROUP BY
, or ORDER BY
.
There are good reasons to use row numbers, for example, to return a subset of rows from a query result. This is often called pagination, like in the pages resulting from an Internet search. To select a subset in this way, we need to use true row numbers that are increasing and consecutive, regardless of the form of the query.
SQL:2003 specifies “window functions”, which include ROW_NUMBER()
. This returns consecutive numbers that are specific to a query result set. A common use of row numbering is to limit the query result to a range of rows:
/
- main.sql
These functions are currently supported by many leading brands of databases, including Oracle, Microsoft SQL Server 2005, IBM DB2, PostgreSQL 8.4, and Apache Derby.
MySQL, SQLite, Firebird, and Informix don’t support SQL:2003 window functions, but they have proprietary syntax we can use in the scenario presented in this section. MySQL and SQLite support a LIMIT
clause, and Firebird and Informix support a query option with keywords FIRST
and SKIP
.
Using GUIDs#
We could also generate random pseudo key values, as long as we don’t use any number more than once. Some databases support a globally unique identifier for this purpose.
A globally unique identifier (GUID) is a pseudorandom number of 128 bits (usually represented by 32 hexadecimal digits). For practical purposes, a GUID is unique, so we can use it to generate a pseudo key.
The following example uses Microsoft SQL Server 2005 syntax:
This creates a row like the following:
bug_id | summary |
---|---|
0xff19966f868b11d0b42d00c04fc964ff | Crashes when I save |
We gain at least two advantages over traditional pseudo key generators when we use GUIDs:
-
We can generate pseudo keys on multiple database servers concurrently without using the same values.
-
No one will complain about gaps — they’ll be too busy complaining about typing thirty-two hex digits for primary key values.
The latter point leads to some of the disadvantages:
-
The values are long and hard to type.
-
The values are random, so we can’t infer any pattern, such as relying on a greater value to indicate a more recent row.
-
Storing a GUID requires 16 bytes. This takes more space and runs more slowly than using a typical 4-byte integer pseudo key.
The most important problem#
Now that we know the problems caused by renumbering pseudo keys and some alternative solutions for related goals, we still have one big problem to solve: how do we fend off an order from a boss who wants us to tidy up the database by closing the gaps in a pseudo key? This is a problem of communication, not technology. Nevertheless, we might need to manage our manager to defend the data integrity of our database. Here are a few tips that normally work:
-
Explain the technology. Honesty is usually the best policy. Be respectful and acknowledge the feeling behind the request. For example, we can tell our manager:
“The gaps do look strange, but they’re harmless. It’s normal for rows to be skipped, rolled back, or deleted from time to time. We allocate a new number for each new row in the database instead of writing code to figure out which old numbers we can reuse safely. This makes our code cheap to develop, makes it faster to run, and reduces errors.” -
Be clear about the costs. Changing the primary key values seems like a trivial task. Still, we should give realistic estimates for the work it will take to calculate new values, write and test code for handling duplicate values, cascade changes throughout the database, investigate the impact on other systems, and train users and administrators to manage the new procedures.
Most managers prioritize based on the cost of a task, so they are unlikely to insist on insignificant, micro-optimizing work when they’re confronted with the real cost. -
Use natural keys. If our manager or other users of the database insist on interpreting meaning in the primary key values, then we can let there be meaning. Instead of using pseudo keys, use a string or a number that encodes some identifying meaning. Then it’s easier to explain any gaps within the context of the meaning of these natural keys.
We can also use both a pseudo key and another attribute column we use as a natural identifier. Hide the pseudo key from reports if gaps in the numeric sequence make readers anxious.